The violent crime rate in U.S increased by 3.4 percent nationwide in 2016 in US. As an international student, as well as a New Yorker, the public safety in NYC is always a concern to us, especially after the recent terrorists attack near the World Trade Center. Thus, our group decided to make a deeper investigation of the crime data and seek out some underlying reasons which led to the increase of crime rate.
The New York City Police Department provides overall crime dataset. NYPD also established a CompStat model, called CompStat 2.0, providing greater specificity about crimes through an online interactive experience.\ On the official website of new york city, there is also a Crime Map which enables people to view crime by precinct. This map includes crimes of seven major felonies.
Since the dataset has 341716, 9, we randomly sample 50000 observations and creat an interactive map showing locations where the crimes in New York City occured:
sample <- nyc_crime_2017[sample(1:nrow(nyc_crime_2017), 50000,replace=FALSE),]
sample %>%
mutate(text_label = str_c("Offense desc:", ofns_desc, ' Boro: ', boro)) %>%
plot_ly(x = ~longitude, y = ~latitude, type = "scatter", mode = "markers",
alpha = 0.5,
color = ~ofns_type,
text = ~text_label)
*Collect historic data of crimes
nyc_hist_vio = read_excel("./historic/violation-offenses-2000-2016.xls", range = "A4:R6") %>%
mutate(ofns_type = "VIOLATION")
nyc_hist_felony_7 = read_excel("./historic/seven-major-felony-offenses-2000-2016.xls", range = "A5:R12") %>%
mutate(ofns_type = "FELONY")
nyc_hist_felony = read_excel("./historic/non-seven-major-felony-offenses-2000-2016.xls", range = "A5:R13") %>%
mutate(ofns_type = "FELONY")
nyc_hist_mis = read_excel("./historic/misdemeanor-offenses-2000-2016.xls", range = "A4:R21")%>%
mutate(ofns_type = "MISDEMEANOR")
We combine the information of crimes in past 16 years.
nyc_crime_hist = nyc_hist_mis %>%
full_join(nyc_hist_felony) %>%
full_join(nyc_hist_felony_7) %>%
full_join(nyc_hist_vio) %>%
mutate(ofns_type = as.factor(ofns_type), ofns_desc = OFFENSE) %>%
select(-OFFENSE)
nyc_crime_hist = nyc_crime_hist %>%
gather(key = year, value = count, "2000":"2016") %>%
group_by(year, ofns_type) %>%
summarize(n = sum(count)/12) %>%
full_join(nyc_crime_2017 %>%
group_by(ofns_type) %>%
summarize(n = n()/10) %>%
mutate(year = "2017")) %>%
ungroup()
nyc_crime_hist %>%
mutate(year = as.numeric(year)) %>%
ggplot(aes(x = year, y = n, fill = ofns_type)) + geom_bar(stat = "identity")
Bar chart showing crime number and offense type in different boro:
barplot = nyc_crime_2017 %>%
mutate(boro = fct_infreq(boro)) %>%
ggplot(aes(x = boro, fill = ofns_type)) + geom_bar()
ggplotly(barplot)
The count of different type of crimes based on 2017 data
crime_tidy2 = nyc_crime_2017 %>%
group_by(date, ofns_type) %>%
summarize(crime_count = n())
ggplot(crime_tidy2, aes(x = date, y = crime_count, color = ofns_type)) +
geom_point(alpha = .6) + geom_smooth() +
theme(legend.position = "bottom")
We then focused on crime data of current year.
Make a plot of crime count versus hour in a day and group by boro.
nyc_crime_2017 %>%
mutate(hour = hour(time)) %>%
group_by(hour, boro) %>%
summarize(n = n()) %>%
ggplot(aes(x = hour, y = n, color = boro)) + geom_point(alpha = 0.5) + geom_line()
Make a crime rate plot based on 2017 data
crime_tidy = nyc_crime_2017 %>%
separate(date, into = c("year", "month","day"), sep = "-") %>%
select(-year, -day) %>%
group_by(month,boro) %>%
summarize(crime_count = n())
crimetotal = ggplot(crime_tidy, aes(x = month, y = crime_count, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
crime_rate = crime_tidy %>%
mutate(popluation = recode(boro, "BRONX" = 1455720,
"BROOKLYN" = 2629150,
"MANHATTAN" = 1643734,
"QUEENS" = 2333054,
"STATEN ISLAND" = 476015)) %>%
mutate(crime_rate = (crime_count/popluation)*100000)
crimerate = ggplot(crime_rate, aes(x = month, y = crime_rate, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
library(gridExtra)
grid.arrange(crimetotal, crimerate, ncol = 2)
nyc_crime = read_csv("./NYPD_Complaint_Data_Current_YTD.csv") %>%
clean_names() %>%
select(boro = boro_nm)
crime_number = nyc_crime %>%
group_by(boro) %>%
summarise(n = n())
population = read_csv("./NYC_Population_by_Borough.csv") %>%
mutate(boro = Borough) %>%
select(-Borough)
nyc_crime_population = left_join(population, crime_number, by = "boro") %>%
clean_names() %>%
mutate(population = as.numeric(population)) %>%
mutate(crime_rate = n / population * 100000)
url = "http://www.baruch.cuny.edu/nycdata/income-taxes/med_hhold_income.htm"
household_income = read_html(url)
income_by_region = (household_income %>%
html_nodes(css = "table"))[[1]] %>%
html_table() %>%
.[c(8,9,10,11,12), c(3,4)]
as_tibble()
## # A tibble: 0 x 0
income_by_region = income_by_region %>%
clean_names() %>%
mutate(boro = x3, median_income = x4) %>%
select(-x3, -x4)
income = read_csv("./NYC_Income_by_Borough.csv") %>%
clean_names() %>%
mutate(boro = borough) %>%
select(-borough)
crime_income = left_join(income, nyc_crime_population, by = "boro")
crime_income %>%
ggplot(aes(x = income, y = crime_rate, color = income)) + geom_point(alpha = 0.5) + geom_smooth() +
labs(title = "Corelation between family median income and crime rate in each borough",
x = "Income Range",
y = "Crime rate")
In addition, we have a strong interest in finding potential factors that may associated with criminal rate. In this case, we choose household income level. After reading data from the web, data cleaning and data visualization, we are surprized to see from the scatter plot: Both lower-income borough and higher-income borough have an extremely high crime rate. For example, Bronx borough’s family median income is 35176 dollars, associated with a crime rate of 0.029. That is, we expect 29 crime cases among every 1000 people. In contrast, Family income ranged between 60000 dollars to 70000 dollars tends to have the lowerest crime rate. Taking Queens as an example, we expect only 15 crime cases among every 1000 people.
Top 10 words in of offense description:
library(tidytext)
crime_words = nyc_crime_2017 %>%
select(-longitude, -latitude) %>%
mutate(ofns_desc = str_to_lower(ofns_desc),
ofns_desc = str_replace(ofns_desc, "[2-3]",""),
ofns_desc = as.character(ofns_desc)) %>%
unnest_tokens(word, ofns_desc)
data(stop_words)
crime_word_tidy =
anti_join(crime_words, stop_words)
crime_word_tidy %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip()
Comparison of distinct words in offense type of violation and felony.
word_ratios = crime_word_tidy %>%
filter(ofns_type %in% c("VIOLATION" , "FELONY")) %>%
count(word, ofns_type) %>%
group_by(word) %>%
filter(sum(n) >= 5) %>%
ungroup() %>%
spread(ofns_type, n, fill = 0) %>%
mutate(
violation_odds = (VIOLATION + 1) / (sum(VIOLATION) + 1),
felony_odds = (FELONY + 1) / (sum(FELONY) + 1),
log_OR = log(felony_odds / violation_odds)
) %>%
arrange(desc(log_OR))
word_ratios %>%
mutate(pos_log_OR = ifelse(log_OR > 0, "felony_odds >violation_odds" ,"violation_odds > felony_odds")) %>%
group_by(pos_log_OR) %>%
top_n(10, abs(log_OR)) %>%
ungroup() %>%
mutate(word = fct_reorder(word, log_OR)) %>%
ggplot(aes(word, log_OR, fill = pos_log_OR)) +
geom_col() +
coord_flip() +
ylab("log odds ratio (felony_odds/violation_odds)") +
scale_fill_discrete(name = "") +
theme(legend.position = "bottom")